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(54) A reconfigurable signal processor with embedded flash memory device 



(57) The present invention relates to a dynamically 
reconfigurable processing unit (1) including an embed- 
ded Flash memory device (3) for non-volatile storage of 
code, data and bit-streams, the unit (1 ) being integrated 
into a single chip together with a microprocessor (2) 



core. Advantageously, the processing unit further com- 
prises an S-RAM based embedded FPGA unit struc- 
tured for FPGA reconfigurations having a specific pro- 
gramming interface (7) connected to a port (FP) of said 
Flash memory device (4) through a DMA channel (8). 
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Description 

Field of invention 

[0001] The present invention relates to a dynamically 
reconfigurable processing unit tightly connected to a 
Flash EEPROM memory subsystem. 
[0002] More specifically, the invention relates to 
reconfigurable signal processing IC with an embedded 
Flash memory device for non-volatile storage of code, 
data and bit-streams, the unit being integrated into a sin- 
gle chip together with a microprocessor core. 

Prior art 

[0003] As is well know by those skilled in this technical 
field, increasing complexity of system design and short- 
er time-to-market requirements are leading research to- 
wards the investigation of hybrid systems including 
processors enhanced by programmable logic. 
[0004] In this respect, reference is made to the work 
by Young-Don Bae et al., "A Single-Chip Programmable 
Platform Base on A Multithreaded Processor and Con- 
figurable Logic Clusters", ISSCC 2002 Digest of Tech- 
nical Papers, pp 336-337, Feb. 2002. 
[0005] Moreover, a further reference may be consid- 
ered the article by Zhang et al., having title: "A 1 V Het- 
erogeneous Reconfigurable Processor IC for Baseband 
Wireless Applications", ISSCC 2000 Digest of Technical 
Papers, pp 68-69,488, Feb. 2000. 
[0006] At the same time raising costs of mask sets 
and shorter time-to-market available for new products, 
are leading to the introduction of systems with a higher 
degree of programm ability and configurability, such as 
system-on-chip with configurable processors, embed- 
ded FPGA and embedded flash memory. 
[0007] Moreover, the availability of an advanced em- 
bedded flash technology, based on NOR architecture, 
together with innovative IP's, like embedded flash mac- 
rocells with special features, is a key factor. 
[0008] For a better understanding of the present in- 
vention reference is also made to the Field Programma- 
ble Gate Array (FPGA) technology combining standard 
processors with embedded FPGA devices. 
[0009] These solutions allows to configure into the 
FPGA at deployment time exactly the required periph- 
erals, exploiting temporal re-use by dynamically recon- 
figuring the instruction-set at run time based on the cur- 
rently executed algorithm. 

[001 0] The existing models for designing FPGA/proc- 
essor interaction can be grouped in two main catego- 
ries: 

the FPGA is a co-processor communicating with the 
main processor through a system bus or a specific 
I/O channel; 

the FPGA is described as a function unit of the proc- 



essor pipeline. 

[0011] The first group includes the GARP processor, 
known from the article by T. Callahan, J. Hauser, and J. 

5 Wawrzynek having title: "The Garp architecture and C 
compiler" IEEE Computer, 33(4) : 62-69, April 2000. A 
similar architecture is provided by the A-EPIC processor 
that is disclosed in the article by S. Palem and S. Talla 
having title: "Adaptive explicit parallel instruction com- 

10 puting". Proceedings of the fourth Australasian compu- 
ter Architecture Conference (ACOAC), January 2001 . 
[0012] In both cases the FPGA is addressed via ded- 
icated instructions ; moving data explicitly to and from 
the processor. Control hardware is kept to a minimum, 

'5 since no interlocks are needed to avoid hazards, but a 
significant overhead in clock cycles is required to imple- 
ment communication. 

[0013] Only when the number of cycles per execution 
of the FPGA is relatively high, the communication over- 
do head may be considered negligible. 

[0014] In the commercial world, FPGA suppliers such 
as Altera Corporation offer digital architectures based 
on the US Patent No. 5 ; 968 : 1 61 to T.J. Southgate, "FP- 
GA based configurable CPU additionally including sec- 
25 ond programmable section for implementation of cus- 
tom hardware support". 

[0015] Other suppliers (Xilinx, Triscend) offer chips 
containing a processor embedded on the same silicon 
IC with embedded FPGA logic. See for instance the US 
30 Patent 6,467,009 to S.P. Winegarden et al., "Configura- 
ble Processor System Unit", assigned to Triscend Cor- 
poration. 

[0016] However, those chips are generally loosely 
coupled by a high speed dedicated bus, performing as 

35 two separate execution units rather than being merged 
in a single architectural entity. In this manner the FPGA 
does not have direct access to the processor memory 
subsystem, which is one of the strengths of academic 
approaches outlined above. 

40 [0017] In the second category (FPGA as a function 
unit) we find architectures commercially known as: "PR- 
ISC"; "Chimaera" and "ConCISe". 
[0018] In all these models, data are read and written 
directly on the processor register file minimizing over- 

45 head due to communication. In most cases : to minimize 
control logic and hazard handling and to fit in the proc- 
essor pipeline stages, the FPGA is limited to combina- 
torial logic only, thus severely limiting the performance 
boost that can be achieved. 

50 [001 9] These solutions represent a significant step to- 
ward a low-overhead interface between the two entities. 
Nevertheless, due to the granularity of FPGA operations 
and its hardware oriented structure, their approach is 
still very coarse-grained, reducing the possible resource 

55 usage parallelism and again including hardware issues 
not familiar nor friendly to software compilation tools and 
algorithm developers. 

[0020] Thus, a relevant drawback in this approach is 
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often the memory data access bottleneck that often forc- 
es long stalls on the FPGA device in order to fetch on 
the shared registers enough data to justify Its activation. 
[0021 ] The technical problem of the present invention 
is that of providing a new kind of reconf igurable process- 
ing unit tightly connected to a memory architecture hav- 
ing functional and structural features capable to offer 
significant performance and energy consumption en- 
hancements with respect to a traditional signal process- 
ing device. 

Summary of invention 

[0022] The invention overcomes the limitations of 
similar preceding architectures relying on an embedded 
device of different nature, and a new approach to proc- 
essor/memory interface. 

[0023] According to a first embodiment of the present 
invention, the reconf igurable processing unit targets im- 
age-voice processing and recognition application do- 
mains by joining a configurable and extensible proces- 
sor core and an SRAM-based embedded FPGA. 
[0024] More specifically, the processing unit accord- 
ing to the invention further includes an S-RAM based 
embedded FPGA unit structured for FPGA reconfigura- 
tions having a specific programming interface connect- 
ed to a port FA of said Flash memory device through a 
DMA channel. 

[0025] The features and advantages of the process- 
ing unit according to this invention will become apparent 
from the following description of a best mode for carrying 
out the invention given by way of non-limiting example 
with reference to the enclosed drawings. 

Brief description of the drawings 

t 

[0026] 

. Figure 1 is a block diagram of a processing unit ar- 
chitecture for data processing according to the 
present invention; 

Figure 2 is a block diagram of a Flash memory ar- 
chitecture embedded into the processing unit of Fig- 
ure 1; 

Figure 3 is a schematic view of system memory hi- 
erarchy provided by the present invention; 

Figure 4 is a block diagram of a specific processor 
extension, for instance added DSP instructions ex- 
amples; 

Figure 5 is a block diagram of a further specific proc- 
essor extension, for instance an optimized fixed- 
point calculation of the square root accounts; 

Figure 6 is a table view showing the overall perform- 



ance improvements for a face recognition task im- 
plemented by the processing unit of the present in- 
vention; 

5 Figure 7 is a schematic chip micrograph. 
Detailed description 

[0027] With reference to the drawings views, gener- 
ic ally shown at 1 is a processing unit realized according 
to the present invention for digital signal processing 
based on reconfigurable computing. 
[0028] The processing unit 1 includes an embedded 
Flash memory device 4 for non-volatile storage of code, 
15 data and bit-streams and a further S-RAM based em- 
bedded FPGA unit 3 realized for the configuration pur- 
poses of the present invention. 
[0029] More specifically, a 8Mb application-specific 
embedded flash memory device 4 is disclosed. The 
20 memory device 4 is integrated into a single chip together 
.with a microprocessor 2 and the FPGA structure 3. 
[0030] Advantageously, application-specific hard- 
ware units are added and dynamically modified by the 
embedded FPGA 3 reconfiguration. By implementing 
25 application-specific vector processing instructions the 
processing unit 1 shows a peak computing power of 
1GOPS. 

[0031 ] Efficient read-write-erase access to code, data 
and FPGA bitstreams is provided by the Flash memory 
30 device 4 based on a modular 8Mb, 4-bank Flash mem- 
ory, as will be more clearly explained hereinafter. 
[0032] The processing unit 1 comprises three con- 
tent-specific I/O ports and delivers an aggregate peak 
read throughput of 1 .2GB/s. 
35 [0033] The system architecture 1 is illustrated in Fig- 
ure 1. 

[0034] The functional purposes of the embedded FP- 
• GA 3 are: 

40 j) extension of the processor datapath supporting a 
set of additional special-purpose C-callable micro- 
processor instructions; 

ii) bus-mapped coprocessors, connected to the sys- 
45 tern bus through a master/ slave interface; 

iii) flexible I/O to connect external units or sensors 
with application-specific communication protocols. 

so [0035] Even though such different circuit purposes 
. would require different kinds of programmable logic for 
best implementation of either arithmetic-dominated or 
control-dominated logic, a single programmable logic 
subsystem 3 has been implemented to be shared 
55 among different purposes both in space (same config- 
uration) and time (subsequent configurations). 
[0036] The single, high I / O count, fine-grain e-FPGA 
3 operates as a datapath for the microprocessor pipeline 
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and as dedicated control logic for bus coprocessor and 
I/O control interface. The FPGA has a specific program- 
ming interface 7 connected to a port FP of said Flash 
memory device 4 through a DMA channel 8. 
[0037] FPGA reconfiguration is concurrent to soft- 
ware execution. 

[0038] A local bus 6 connects a dedicated 32-bit Flash 
memory port FP to the FPGA programming interface 7. 
[0039] A DMA channel 8 handles the bitstream trans- 
fer, while microprocessor fetches instructions and data 
from different Flash memory ports: 64-bit wide code port 
(CP) and data port (DP). 

[0040] To support streaming applications a 1 kB dual- 
port buffer 9 is used to interface fast decoding hardware 
and slower software running on the processor 2. 
[0041 ] The memory sub-system architecture is shown 
in Figure 2. 

[0042] The modular structure of the memory (dotted 
line) includes: 

charge pumps 10 (Power Block); 

testability circuits 11 (DFT); 

a power management arbiter 12 (PMA); and, 

a customizable array 1 3 of N independent 2Mb flash 
memory modules 16. 

[0043] Depending on the storage requirements the 
number N may be chosen; N=4 in the current implemen- 
tation. 

[0044] The modular memory features (N+2) 128-bit 
target ports and implements a N-bank uniform memory 
13. 

[0045] As previously mentioned, three content-specif- 
ic ports are dedicated to code (CP, 64-bit wide), data 
(DP, 64-bit) and FPGA bit stream configurations (FP, 
32-bit). A 128 bit sub-system crossbar 15 connects all 
the architecture blocks and the eight bit microprocessor 
2. 

[0046] The main features of such the flash memory 
device 4 are: charge pump 10 sharing among different 
flash memory modules 16 through the PMA arbiter 12 
in a multi-bank fashion. Moreover, the use of a small 
eight bit micro processor 2 to easy memory system test 
and to add complex functionalities for data manage- 
ment, and the use of an ADC (Analog-to-Digital Con- 
verter), required by the application, to increase system 
self test capability. 

[0047] The third FP port of the Flash device 4 is ded- 
icated to manage embedded-FPGA (e-FPGA) configu- 
rations data stored in flash memory modules. The FP 
port is read-only and provides fast sequential access for 
bit streams downloading. 

[0048] The FP has four configuration registers repli- 
cating the information stored in CP port that must be 
used in order to write e-FPGA configurations data. 



[0049] The output data word bus and the address bus 
are 32 bits wide. The FP port uses a chip select to ac- 
cess in the addressable memory space, and a burst en- 
able to allow burst serial access. 
5 [0050] In read operation, an output ready signal is tied 
low when data are not immediately available, so that it 
can acts as a wait state signal. 
[0051] The eight-bit microprocessor 2 (uP) performs 
additional complex functions (defragmentation, com- 
10 pression, virtual erase, etc.) not natively supported by 
the DP port, and assists for built-in self test of the mem- 
ory system. 

[0052] The (N+2)x4 1 28-bit crossbar 1 5 connects the 
modular memory with the four initiators (CP, DP, FP and 
uP) providing that at least three flash memory modules 
1 6 can be read in parallel at full speed. 
[0053] The memory space of the four modules 16 is 
arranged in three programmable user-defined parti- 
tions, each one devoted to a port. The memory system 
clock can run up to 1 00MHz, and reading three modules 
16 with 128bit data bus and 40ns access time, results 
in a peak read throughput of 1 .2GB/s. 
[0054] Each 2Mb flash memory module 16 has a 
128-bit IO data bus with 40ns access time, resulting in 
400Mbyte/s, and a program/erase control unit. Simulta- 
neous memory operations use the power management 
arbiter 12 (PMA) for optimal scheduling. 
[0055] Available power and user-defined priorities are 
considered to schedule conflicting resource requests in 
a single clock cycle. 

[0056] The memory device 4 allows up to four simul- 
taneous operations, with a limit of one both for write and 
erase. 

[0057] Figure 3 depicts the memory hierarchy and 
parallelism across the processing unit 1. The ports CP 
and DP are interfaced to the 64-bit, 800MB/S AHB sys- 
tem bus 6. 

[0058] At a system clock rate of 1 00MHz each I/O port 
can independently operate at maximum speed. So, an 
aggregate peak read rate of 1 .2GB/s can be sustained 
as it is limited by memory access time. 
[0059] In the current implementation the e-FPGA 
reconfiguration takes 500u,s at 100 MHz,50MB/s aver- 
age throughput out of the available 400MB/S are cur- 
'rently sustained by the e-FPGA configuration interface 
7. 

[0060] System performance is being evaluated for an 
image processing application (facial recognition) and a 
speech recognition application. 
[0061] More than 20 specific instructions were de- 
signed as C/assembly-callable functions, automatically 
translated to RTL, then synthesized and mapped to the 
e-FPGA. 

[0062] Figures 4 and 5 show two examples of specific 
microprocessor extensions. 

[0063] Figure 4 relates to an eight-issue, eight-bit, L2 
calculation accounts for 23 eight-bit arithmetic opera- 
tions and six 64-bit operations requiring about 1 0k ASIC 
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equivalent gates. 

[0064] Figures 5 relates to a datapath for an optimized 
fixed-point calculation of the square root accounts for 
twelve 32-bit operations for about 2k ASIC equivalent 
gates. 

[0065] The overall performance improvements for the 
face recognition tasks are shown in the table of Figure 6. 
[0066] Execution time is compared for 32-bit RISC 
with basic DSP extensions (MAC : zero-overhead loops, 
etc) and the same processor enhanced with application- 10 
specific instructions. 

[0067] Measured speed-ups range from 1 .8x to 1 0.6x 
(on the most-demanding task) : with an overall improve- 
ment of 8.5x. It must be noticed that switching between 
algorithm stages requires only one reconfiguration of ^ 
the e-FPGA. Reconfiguration time is negligible. 
[0068] The speed-up factors take into account the 
possible multi-cycle clock penalty due to processor-FP- 
GA synchronization in case of instruction extensions 
slower than the processor clock. Energy efficiency fig- 20 
ures are reported in Figure 6 too. 
[0069] As the average power consumption of the sys- 
tem extended with the e-FPGA is slightly higher 
(10-15%), the energy reduction for executing each of the 
tasks on its specific HW configuration (power-delay 25 
product improvement) results in an overall reduction of 
6.7x. 

[0070] Only one task showed slightly worse total ex- 
ecution energy : though showing benefits on execution 
speed. 30 
[0071 ] Last column of Figure 6 reports the energy-de- 
lay improvement of each specific HW configuration 
compared to the general-purpose counterpart. Energy 
required for e-FPGA reconfiguration is always negligi- 
ble. 35 
[0072] Measurements show the best energy efficien- 
cy in the range of several MOP.S/mW at 1 .8V supply. It 
lies between conventional ASIP/DSP and dedicated 
configurable hardware implementations. 
[0073] The full-processing unit on a single chip is im- *o 
plemented in a 0.1 8nm, 2PL-6ML CMOS embedded 
Flash technology, chip area is 70mm 2 , technology and 
device characteristics are summarized in Figure 6 while 
a chip micrograph is shown in Figure 7. 



Claims 

1 . A dynamically reconfigurable processing unit (1 ) in- 
cluding an embedded Flash memory device (3) for 50 
non-volatile storage of code, data and bit-streams, 
the unit (1) being integrated into a single chip to- 
gether with a microprocessor (2) core, further com- 
prising an S-RAM based embedded FPGA unit 
structured for FPGA reconfigurations having a spe- 55 
cific programming interface (7) connected to a port 
(FA) of said Flash memory' device (4) through a 
DMA.channel (8). 



2. A dynamically reconfigurable processing unit ac- 
cording to claim 1, wherein said DMA channel (8) 
handles the bitstream transfer while said microproc- 
essor (2) fetches instructions and data from differ- 
ent Flash memory ports of said Flash memory de- 
vice (4); a wide code port (CP) and a data port (DP). 

3. A dynamically reconfigurable processing unit ac- 
cording to claim 2, wherein said Flash memory de- 
vice (4) includes a modular array structure (13) 
comprising N memory blocks (16), and wherein a 
power block (10) : including charge pumps, is 
shared among different flash memory modules (1 6) 
through a PMA arbiter (1 2) in a multi-bank fashion. 

4. A dynamically reconfigurable processing unit ac- 
cording to claim 1, wherein said embedded FPGA 
unit (3) exploits the following functions: 

iv) extension of the processor datapath sup- 
porting a set of additional special-purpose C- 
callable microprocessor instructions; 

' v) bus-mapped coprocessors, connected to the 
system bus through a master/ slave interface; 

vi) flexible I/O to connect external units or sen- 
sors' with application-specific communication 
protocols. 

5. A dynamically reconfigurable processing unit ac- 
cording to claim 2 : wherein said Flash memory de- 
vice (4) includes at least three different access 
ports, each for a specific function: 

said code port (CP) optimized for random ac- 
cess time and the application system; 

said data port (DP) allowing an easy way to ac- 
cess and modify application data; and, 

said FPGA port (FP) offering a serial access for 
a fast download of bit streams for an embedded 
FPGA (e-FPGA) configurations. 

6. A dynamically reconfigurable processing unit ac- 
cording to claim 2, wherein said third port (FP) com- 
prises four configuration registers replicating the in- 
formation stored in said code port (CP) that must be 
used in order to write e-FPGA configurations data. 

7. A dynamically reconfigurable processing unit ac- 
cording to claim 5, wherein said third port (FP) uses 
a chip select to access in the addressable memory 
space and a burst enable to allow burst serial ac- 
cess. 

8. A dynamically reconfigurable processing unit ac-. 
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cording to claim 1, wherein said connection be- 
tween said interface (7) and said port (FA) is pro- 
vided by a local bus (6). 

A dynamically reconfigurable processing unit ac- 5 
cording to claim 5 ; wherein said Flash memory de- 
vice (4) includes four modules (16) each arranged 
in at least three programmable user-defined parti- 
tions, each one devoted to a corresponding port. 
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